Group 2
2024-04-22
Descriptive Statistics: Analyzed
engagement statistics including views, likes, and comments.
Performance and Engagement Metrics: Compared engagement
metrics across categories, video duration, and tags.
Content Analysis: Utilized topic modeling and keyword
extraction to identify themes within the content.
Trend Analysis: Examined trends in viewership and
engagement over time.
Tools Used:
- Data Extraction and
Parsing: tuber, httr, jsonlite, tidyverse, skimr, recipes,
dplyr, tidyr, lubridate
- Machine Learning: h2o
- Visualization: plotly, ggplot2
- The data is extracted using the Tuber
Api.
- The dataset offers a comprehensive collection of TEDx talks from the
TedEx YouTube channel, featuring talks aimed at inspiring, educating,
and sparking discussions on various important subjects.
- Each entry includes details such as the video ID, publication time,
title, description, tags, category ID, default audio language, duration,
dimension, caption availability, licensed content status, view count,
like count, favorite count, and comment count.
- The dataset offers insights into the content and engagement metrics of
these TedEx talk videos , showcasing diverse topics and audience
responses.
## Rows: 20,268
## Columns: 12
## $ Utc_Day_Part <chr> "Afternoon", "Afternoon", "Afternoon", "Afterno…
## $ Month <chr> "March", "March", "March", "March", "March", "M…
## $ Day_Of_Week <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tu…
## $ Title <chr> "The Great Diffusion | Alex Lazarow | TEDxSonom…
## $ Description <chr> "Over the last 150 years, unprecedented technol…
## $ Tags <chr> "Business,Economics,English,Entrepreneurship,Fu…
## $ Duration_Minutes <dbl> 10, 12, 11, 16, 16, 7, 6, 10, 12, 10, NA, 11, 1…
## $ Default_Audio_Language <chr> "en", "en", "en", "en", "en", "en", "pl", "pl",…
## $ Caption <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
## $ View_Count <dbl> 77, 71, 313, 62, 180, 347, 119, 179, 90, 27419,…
## $ Like_Count <dbl> 3, 2, 13, 0, 10, 17, 4, 4, 3, 72, 1500, 18, 12,…
## $ Comment_Count <dbl> 0, 0, 4, 0, 0, 15, 1, 1, 0, 40, 49, 3, 0, 0, 5,…
- Categorized “Published_Time” to Night, Morning, Afternoon, Evening UTC
Day Parts
- Extracted day of the week from “Published_Time”
- Extracted month from “Published_Time”
- Extracted minutes from “Duration”
- Pre-processed “Tags” column by removing unnecessary keywords
- Removed low variance columns
- Excluded videos from the last 3 weeks
- Calculated date 3 weeks before max date
- Filtered out rows within last 30 days
- Separated data for text and non-text models
- Factorized categorical variables using recipe and bake
- Saved the post-processed data into a .rds format
ggplot(
word_probs,
aes(term, beta, fill=as.factor(topic))
) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()